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Abstract 

This  paper  is  concerned  with  scattered  data  approximation  in  high  dimensions: 
Given  a  data  set  X  C  of  N  data  points  xl  along  with  values  yl  £  M.d  ,  i  =  1 , ,N, 
and  viewing  the  yl  as  values  yl  =  /( xl)  of  some  unknown  function  /,  we  wish  to  return 
for  any  query  point  x  £  an  approximation  f(x)  to  y  =  f(x).  Here  the  spatial 
dimension  d  should  be  thought  of  as  large.  We  wish  to  emphasize  that  we  do  not 
seek  a  representation  of  /  in  terms  of  a  fixed  set  of  trial  functions  but  define  /  through 
recovery  schemes  which,  in  the  first  place,  are  designed  to  be  fast  and  to  deal  efficiently 
with  large  data  sets.  For  this  purpose  we  propose  new  methods  based  on  what  we  call 
sparse  occupancy  trees  and  partitioning  schemes  based  on  simplex  subdivisions. 

1  Introduction 

Methods  for  high- dimensional  function  approximation  and  statistical  learning  are  com¬ 
monly  categorized  into  two  classes.  For  an  overview  we  refer  to  [HTF09].  On  the  one 
hand  we  have  parametric  methods  which  try  to  fit  the  function  globally,  typically  by 
prescribing  the  structure  of  the  approximant  by  defining  a  set  of  trial  functions  and 
learning  the  coefficients  in  this  approach  by  optimizing  some  error  norm.  Here  one 
can  think  for  example  of  generalized  additive  models,  projection  pursuit  or  artificial 
neural  networks.  Although  these  methods  have  been  applied  successfully  in  a  large 
number  of  applications  they  also  have  some  drawbacks.  First,  the  class  of  functions 
that  are  approximated  well  by  such  techniques  is  typically  small  and  the  right  model 
has  to  be  determined  a  priori.  Second  the  training  stage  usually  involves  the  solution 
of  a  non-linear  optimization  problem  which  might  be  a  demanding  and  time-consuming 
process  effectively  limiting  the  size  of  the  data  that  can  be  handled.  Furthermore  these 
approximations  cannot  easily  be  adjusted  to  new  data,  for  example  in  the  case  of  in¬ 
cremental  online  learning  or  applications  where  the  domain  in  which  the  function  is  to 
be  evaluated  changes  in  time. 

On  the  other  side  of  the  spectrum  there  are  non-parametric  methods  which  try 
to  fit  functions  locally,  usually  by  partitioning  the  input  space  and  then  using  simple 
local  models  like  piecewise  constant  approximations.  The  idea  of  being  content  with 
piecewise  constants  is  supported  by  classical  concentration  of  measure  results  according 
to  which  a  well-behaved  function  (e.g.  Lipschitz-continuous)  in  very  high-dinrensions 
deviates  from  its  mean  or  median  by  much  only  on  sets  of  small  measure. 
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A  typical  example  for  such  a  recovery  strategy  is  to  determine  for  any  given  query 
point  its  k  nearest  neighbors  in  the  given  data  site  and  to  use  their  average  as  the 
approximate  function  value.  On  first  glance  this  kind  of  memory-based  learning  does 
not  seem  to  require  any  training  process  except  of  reading  and  storing  the  incoming 
data.  In  practice,  however,  it  is  necessary  to  design  a  data  structure  that  provides  a 
fast  solution  to  the  question,  which  are  the  nearest  neighbors  of  a  query  point  x.  Un¬ 
fortunately,  the  exact  solution  requires  either  a  preprocessing  time  which  is  exponential 
in  d  or  a  single  query  time  which  is  linear  in  N,  the  latter  characterizing  the  brute  force 
algorithm  where  the  distance  ||x*  —  x\\  is  computed  for  each  training  point.  Actually, 
for  function  recovery  purposes  one  would  also  be  satisfied  with  an  approximate  solution 
which  can  be  achieved  much  more  efficiently.  Here  we  refer  to  [Ind04,  LMGY05]  which 
give  some  of  the  basic  ideas  about  the  algorithms  which  could  be  useful  in  this  con¬ 
text.  However,  none  of  the  currently  available  methods  seems  to  be  very  performant 
if  d  goes  into  the  hundreds  and  N  into  the  millions.  An  application  background  from 
climatology  in  which  problems  of  this  size  arise  will  be  described  in  a  subsequent  report 
[BBD+10]. 

Therefore,  in  situations  where  a  fast  return  to  a  query  of  a  function  evaluation 
matters,  such  as  when  approximate  function  values  are  required  in  discretizations  of 
PDEs,  say,  alternative  strategies  may  be  preferable.  In  view  of  these  considerations  we 
develop  and  investigate  in  this  paper  some  methods  which  in  the  first  place  are  designed 
to  be  fast  and  to  deal  efficiently  with  large  data  sets  and  to  provide  fast  algorithms 
for  evaluation.  The  motivation  behind  our  approach  is  to  explore  the  potential  of 
multiresolution  ideas  for  high  spatial  dimension. 

In  the  last  two  decades  multilevel  methods  have  proved  to  be  essential  for  ap¬ 
proximating  functions  with  inhomogeneous  local  structural  properties  and  have  been 
successfully  applied  in  different  forms  in  several  areas  ranging  from  image  processing 
to  solutions  of  partial  differential  equations.  A  central  ingredient  of  multiresolution 
analysis  is  the  tree  which  describes  the  relations  between  levels  of  resolution.  In  a 
standard  setting  the  initial  domain  H  is  related  to  a  cube  which  is  the  root  of  the  tree. 
Then,  using  consecutive  dyadic  partitions  one  can  define  different  levels  of  resolution 
and  build  the  corresponding  tree  structure  level  by  level.  To  this  end,  as  a  key  tool 
for  data  organization  we  propose  the  notion  of  sparse  occupancy  trees.  The  underlying 
concepts,  perhaps  with  a  different  terminology,  have  been  certainly  used  in  somewhat 
different  contexts  such  as  nearest  neighbor  search.  To  our  knowledge  its  use  in  recov¬ 
ery  procedures  seems  to  be  new.  Instead  of  using  the  full  tree  T  (called  the  master 
tree),  we  consider  only  a  subtree  T{X)  whose  nodes  correspond  to  the  cubes  that  are 
occupied,  i.e.  contain  at  least  one  element  from  the  set  X.  A  special  indexing  and 
ordering  of  these  cubes  allows  us  to  store  of  all  the  information  about  the  tree  using 
only  O(LdN)  bits  where  L  is  a  chosen  upper  limit  for  the  number  of  levels  in  the  tree. 
The  sparse  occupancy  tree  can  then  be  used  as  a  tool  for  constructing  approximations 
to  the  function  represented  by  the  data. 

It  is  important  to  mention  that  we  are  not  considering  the  sparse  occupancy  trees 
and  the  problem  of  nearest  neighbors  as  separate  issues.  We  want  to  blend  them  in 
such  a  way  that  the  resulting  solution  will  give  efficient  and  reliable  recovery  schemes 
for  a  for  a  variety  of  problems  of  the  above  type.  In  this  context  we  have  to  mention, 
that  one  can  hardly  expect  a  good  performance  of  the  nearest  neighbor  scheme  our  our 
schemes  if  d  is  very  large  and  the  rc-data  is  distributed  uniformly  in  Rrf.  In  this  case 
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the  average  distance  between  two  data  points  would  be  large  even  for  huge  data  sets, 
and  any  method  based  on  localization  strategies  would  be  doomed  to  fail.  However, 
in  practical  problems,  the  input  variables  are  often  strongly  correlated  and  the  true 
intrinsic  dimensionality  of  the  data  is  much  lower  than  the  formal  dimension  of  the 
problem.  Multiresolution  trees  seem  in  particular  suited  to  capture  such  coherent 
structures  and,  hence,  to  mitigate  the  curse  of  dimensionality. 

In  Section  2  we  introduce  a  most  simple  algorithm  providing  a  piecewise  constant 
approximation.  For  a  given  point  x  €  X  it  finds  the  finest  cube  K(x)  from  T (X)  to 
which  it  belongs.  Then  the  approximated  value  at  x  could  be  set  to  be  the  mean  value 
of  the  points  from  K(x)  fi  X .  The  time  for  a  single  query  for  the  node  of  the  sparse 
tree  that  corresponds  to  any  point  from  the  domain  fl  will  be  seen  to  be  (9 (log  N) .  A 
detailed  study  of  this  case  and  numerical  tests  provide  useful  insight  concerning  the 
following  issues.  First,  the  quality  of  this  approximation  depends  significantly  on  the 
size  of  the  cube  K(x)  which  could  be  large  even  if  there  are  points  from  X  close  to  x. 
The  latter  is  subject  to  the  way  the  partition  is  set.  This  is  the  case  when  the  geodesic 
distance  is  much  smaller  than  the  distance  in  the  tree.  Of  course,  ignoring  the  function 
values  in  a  neighboring  cube  which  as  a  node  in  the  tree  is  far  from  the  cube  holding 
the  query  point  x,  is  likely  to  lead  to  highly  inappropriate  assignments  of  approximate 
function  values.  This  observation  will  guide  several  attempts  to  improve  upon  the 
above  described  basic  strategy.  Since  the  evaluation  process  is  very  fast,  a  first  idea 
is  to  generate  several  partitions  of  the  same  data  by  randomly  shifting  the  partition 
boundaries  and  to  use  the  weighted  averages  of  the  corresponding  approximations.  It 
will  be  shown  that  such  an  approach  indeed  has  a  significant  effect. 

In  Section  3  we  extend  this  technology  to  construct  piecewise  linear  approxima¬ 
tions  on  simplex  subdivisions.  Working  with  simplices  offers  a  number  of  advantages 
which  is  a  common  experience  in  numerical  grid  generation  even  in  lower  dimensions. 
Therefore  it  is  a  little  bit  surprising  that  one  hardly  finds  references  discussing  simplex 
partitioning  methods  in  high  dimensions.  The  elementary  observation  behind  this  is 
that  a  d-dimensional  simplex  has  only  d  +  1  vertices  compared  to  the  2d  vertices  of 
a  hyper-cube.  Prescribing  values  for  the  vertices  one  can  define  piecewise  linear  ap¬ 
proximations  as  demonstrated  in  Section  3.  The  values  at  the  vertices  are  defined  as 
weighted  averages  of  the  points  in  the  surrounding  simplices.  The  value  of  the  query 
point  is  found  by  interpolating  vertex  values  of  the  simplex  the  query  falls  into.  This 
way  the  the  query  response  becomes  an  average  of  all  training  points  in  the  neighbor¬ 
hood  of  the  query  coordinates,  even  of  the  points  to  which  the  tree  distance  is  large. 
Indeed  we  will  show  by  numerical  experiments  that  this  scheme  gives  an  accuracy  like 
the  nearest  neighbor  approximation,  but  much  faster.  All  these  algorithms  are  tested 
in  the  numerical  experiments  of  Section  4. 


2  Sparse  Occupancy  Trees 

In  this  section  we  describe  the  general  construction  of  sparse  occupancy  trees  and 
propose  efficient  data  structures  for  their  practical  realization.  Then  we  will  use  this 
construction  to  design  a  very  fast  recovery  scheme  based  on  cube  subdivision. 
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2.1  Basic  Form  and  Piecewise  Constant  Approximation 

Let  us  assume  that  the  data  set  X  is  contained  in  a  bounded  domain  P  C  Md.  Suppose, 
that  we  have  a  a  hierarchy  of  nested  partitions  of  P: 

{0}  =  Vo  ~<  V\  Vj  -<.. .  , 

which  means  that  for  all  Z  >  0  the  sets  Vi  =  {P^.,  k  £  X/}  are  partitions  of  P  and  each 
cell  Pj  &  £  Pi  is  the  disjunct  union  of  cells  on  the  next  finer  level  l  +  1  : 

=  P;+l,r  • 

r£V,k 

Typically  the  partitions  consist  of  cubes  or  simplices  and  the  refinement  sets  In-  have 
a  fixed  cardinality. 

The  hierarchy  of  partitions  induces  an  infinite  master  tree  X*,  whose  root  is  P  and 
whose  other  nodes  are  the  cells  Pj  Each  node  P ^  of  this  tree  is  connected  by  an 
edge  to  its  children  P;+i,r  where  r  £  X/^. 

In  practice  we  only  consider  finite  subtrees  of  some  fixed  depth  L,  i.e.,  I  <  L.  L  can 
be  determined  according  to  several  possible  criteria,  such  as  spatial  resolution  reflected 
by  diamP^fc,  or  by  separation,  which  means  that  each  leaf  P contains  at  most  one 
data  point  from  X. 

Given  x  £  P  we  denote  by  X (x)  the  branchless  subtree  of  X*,  which  only  contains 
the  cells  containing  x: 

X(x)  =  {P,,fc  £  T*  :  x  £  Pz,fc}  . 

The  sparse  occupancy  tree  X ( X )  is  the  tree  we  get  from  the  master  tree  by  cutting 
all  cells  which  do  not  contain  a  training  point;  equivalently  it  is  the  largest  subtree  of 
X*  such  that  all  its  nodes  contain  an  x  £  X: 

T(X)  :=  U{X(x*)  :  X  £  X}  .  (1) 

A  piecewise  constant  approximation  can  now  be  defined  as  follows:  given  a  training 
set  X  =  {r1,...,^}  and  a  test  point  x  we  average  the  values  in  the  leaf  of  the 
branchless  tree  T(X)  n  X(x): 

/>)  =  AWW  e  £(X(A)  n  X(x))}).  (2) 

where  we  use  the  notation 

a(y)  =  —t—  ^2  y  (3) 

card  y  ' 

v  ’  y&y 

to  denote  the  average  of  a  finite  subset  Y  c  M.d' . 

In  other  words,  we  identify  the  maximum  level  cell  in  the  sparse  occupancy  tree  that 
contains  the  test  point  and  then  average  the  values  of  the  training  points  contained  in 
this  cell. 

Remark  1.  Note  that  the  scheme  is  interpolating  (i.e.,  if  a  query  coincides  with  a 
training  point  the  algorithm  returns  the  value  of  this  training  point)  if  each  leaf  of  the 
tree  contains  only  training  sample. 
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2.2  Occupancy  Trees  as  Sorted  Lists 

It  is  possible  to  write  computer  codes,  that  follow  word-by-word  the  above  construction, 

i.e. ,  one  can  implement  a  structure  with  node  elements  and  pointers  to  their  children  to 
represent  the  occupancy  tree.  However,  especially  with  large  data  sets,  this  typically 
does  not  result  in  the  most  efficient  code.  Instead  we  represent  the  occupancy  tree  by 
a  sorted  list  of  strings,  as  described  in  the  following. 

2.2.1  Data  Structures 

We  define  a  string  b  to  be  a  finite  sequence  of  integers:  b  =  (b\,  62, . . . ,  bn),  where  n 
is  the  length  of  the  key.  The  elements  6*  of  a  string  will  also  be  called  characters.  We 
denote  with  (b,  c)  =  (61, . . . ,  bn ,  c)  the  string  that  results  from  appending  an  additional 
number  to  the  sequence.  If  j  <  n  we  write  b|j  for  the  substring  that  consists  of  the 
first  j  elements  of  b:  b|j  =  (b\, ... ,  bj). 

Our  first  aim  is  to  construct  an  invertible  map  of  the  nodes  in  the  master  tree  to 
the  set  of  strings.  We  can  do  this  recursively. 

First  we  map  the  the  root  node  fl  to  the  empty  string  0.  For  each  node  12/  & 
we  prescribe  an  enumeration  of  its  children  f2/+i,fc0) . . . ,  where  r  =  r(l,  k )  is 

the  cardinality  of  Then  we  assign  for  i  =  0, ...  ,r  —  1  the  strings  b(P/+i;/Ci)  = 
(b(P/jfc),i)  to  the  children  of 

Clearly  a  cell  at  level  l  is  mapped  to  a  string  of  length  l  and  the  mapping  of  all 
cells  in  the  master  tree  into  the  set  of  strings  of  length  l  is  injektive.  Therefore,  given 
a  string  b  in  the  range  of  T*,  we  will  use  the  notation  12(b)  to  denote  the  cell  that  is 
mapped  to  the  string  b. 

Hence,  we  can  make  the  following  simple  observations:  if  b  has  length  n  and  j  <  n 
then  12(b)  C  12(b|j).  In  particular,  if  two  strings  b2,b2  of  length  n  >  j  have  the  first  j 
bits  in  common,  but  fej+1  7^  62+1  then  i2(b1|j)  =  12(b2|j)  is  the  finest  cell  that  contains 
both  12(b2)  and  12(b2). 

Finally,  if  x  E  12  and  a  maximum  level  L  of  the  master  tree  is  given,  we  denote 
with  b(x)  the  string  that  is  assigned  to  the  the  finest  level  cell  12  rz,,  which  contains  x: 
x  €  12(b(®))  G  C(T*). 

2.2.2  Algorithm 

The  approximation  defined  by  equation  2  can  now  be  realized  by  the  following  algo¬ 
rithm  which  consist  of  a  training  stage  and  an  evaluation  stage.  Again  we  assume  that 
the  maximum  level  of  the  master  tree  L  is  prescribed.  The  training  stage  consists  of 
the  following  steps: 

1.  For  every  training  point  xl  compute  the  string  b(®*). 

2.  Sort  the  b(®*)  lexicographically.  Note  that  the  lexicographical  ordering  of  the 
nodes  induces  a  new  ordering  of  the  points  in  X.  Without  loss  of  generality  we 
will  assume  in  what  follows  that  the  points  x1  were  already  ordered  in  the  same 
way.  In  the  implementation  one  has  to  store  the  resulting  permutation,  of  course. 

3.  For  all  n  =  1 , ,N  compute  the  running  sums 

n 

Yn  =  YJyi  =  Yn~1+yn  , 

i=  1 
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where  for  convenience  we  set  Y°  =  (0, . . . ,  0). 

Remark  2.  The  computation  of  the  strings  requires  the  generation  of  0(LN )  char¬ 
acters.  The  sorting  operation  can  be  done  in  0(Nlog(N))  time.  The  storage  is  LN 
characters  for  the  strings  plus  N  integers  for  the  permutation  vector  and  O(d'N)  real 
values  for  the  running  sums. 


Now  let  us  assume  that  a  query  point  x  is  given.  Then  in  the  evaluation  stage  we 
have  to  find  the  finest  cell  in  the  occupancy  tree  that  contains  x  and  to  average  the 
points  in  this  cell.  These  points  correspond  to  the  strings  which  share  the  maximum 
number  of  leading  characters  with  b(x)  among  all  strings  in  the  sorted  list.  These 
strings  can  be  identified  as  follows: 

1.  Find  the  position  n,  such  that  b(xn)  <  b(x)  <  b(xn+1),  where  <  denotes  the 
relation  induced  by  the  lexicographical  ordering. 

2.  Compare  b(x)  with  b(x”)  and  b(xn+1).  The  maximum  number  of  leading  char¬ 
acters  is 

j  :=  rna x{j  :  b(x)|j  =  b(xn)| j  V  b(x)|i  =  b(xn+1)|i}  . 


3.  Find  the  position  m  such  that  b(xm_1)  <  b(x)|j  <  b(xm).  Obviously  xm  is  the 
first  point  that  shares  j  characters  with  b(x),  because  b(x)|j  is  the  smallest  string 
that  starts  with  b(x)|j. 

4.  Generate  the  string  b  =  (b,  R,...,R)  by  appending  L  —  j-times  the  maximum 
cardinality  R  of  all  sets  Mitk-  This  is  obviously  the  last  possible  string  that  begins 
with  b(x)|j.  Then  we  search  the  position  p  such  that  b(xp)  <b<  b(xp+1).  xp  is 
then  the  last  point  in  the  list  that  has  to  be  considered  for  the  averaging. 


5. 


The  value  of  the  approximation  can  then  be  computed  by  evaluation  of  the  run¬ 
ning  sums: 


/(*) 


ryp  _  Ym~  l) 
p  —  m  +  ly 


Remark  3.  This  evaluation  algorithm  essentially  consists  of  three  search  algorithms, 
which,  in  the  worst  case,  can  be  performed  in  log(N)  time  each  by  binary  subdivision. 
In  practice  the  effort  is  usually  better,  because  n,  m,  and  p  are  close  together  in  the 
list,  so  that  n  can  be  used  as  good  initial  guess  for  the  other  two  search  operation.  The 
evaluation  of  the  running  sum  requires  only  constant  time.  Hence  the  operation  count 
for  a  single  function  evaluation  is  0(3dlog(iV)). 


Remark  4.  Especially,  if  the  data  sets  are  very  large,  the  method  of  running  sums 
might  become  a  source  of  numerical  inaccuracies  caused  by  cancellation  or  overflow. 
Remedies  for  this  is  to  break  the  running  sum  into  chunks  or  to  use  several  buckets  for 
values  of  different  sign  or  magnitude. 


Remark  5.  It  is  obvious  that  in  a  computer  program  the  above  algorithm  can  be  im¬ 
plemented  in  the  most  efficient  manner  if  the  characters  are  bits  (or  sequences  of  bits); 
in  this  case  the  strings  can  be  represented  by  bitstreams.  This  is  the  case  for  all  binary 
trees  and  for  the  dyadic  subdivision  schemes  which  we  will  consider  in  this  paper. 
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2.3  Cube  Subdivision 


In  this  section  we  assume  that  fl  =  [0,  l]d  is  the  d-dimensional  hypercube.  For  con¬ 
venience  we  will  omit  the  prefix  ‘“hyper”  most  of  the  time  and  speak  of  cubes  and 
cuboids,  although  in  general  d  ^  3.  If  the  data  is  initially  not  contained  in  [0,  l\d,  this 
can  usually  be  achieved  by  rescaling  the  training  data  X  component-wise,  in  particular 
when  the  training  data  is  given  beforehand  instead  of  coming  in  incrementally. 

2.3.1  Dyadic  cube  subdivision 

In  this  variant  each  cube  is  immediately  subdivided  into  its  2d  subcubes,  i.e.  the 
partitions  are  given  by 

d 

Vi  =  {H[ki2-l,(ki  +  l)2-1},  ki  £  {0, . . .  ,2l  —  1}}  . 

i=  1 

That  means,  each  node  in  the  master  tree  has  2d  children. 

2.3.2  Binary  cube  subdivision 

The  cubes  are  halved  one  dimension  after  another.  I.e.  the  partitions  are  given  by 
Vl  =  {H[ki2-li,(kl  +  l)2-%  kte{  0,...,2^-l}, 

i= 1 

Note  that 

•  the  cells  of  the  partitions  are  generally  not  cubes  but  only  rectangles, 

•  that  the  order  of  the  hyperplane  which  are  used  for  the  subdivision  is  not  deter¬ 
mined  adaptively  but  is  prescribed  (otherwise  than  would  not  have  a  well  defined 
master  tree), 

•  and  that  in  contrast  to  dyadic  subdivision  the  underlying  master  tree  is  now  a 
binary  tree. 

2.3.3  Bitstream  Generation 

Cube  subdivision  is  particular  attractive  because  the  generation  of  the  strings  b(x)  is 
very  simple.  Given  an  input  x  =  (xi, . . . ,  x<j)  £  [0,  l]d  we  can  write  its  components  in 
binary  representation  as 

OO 

Xi  =  ^  bik2~k  ■ 
k=  1 

In  the  case  x*  /  1  is  binary  rational,  we  assume  that  the  sequence  {bik}k  ends  with 
zeros,  while  for  Xj  =  1  we  have  bn~  =  1,  k  £  N.  In  both,  dyadic  and  binary  subdivision, 
the  bitstream  assigned  to  a  point  x  is  then 

{bn,  b-2i,  •  •  • ,  bdi,  bn,  622,  •  •  • ,  bd2,  ■  ■  ■ ,  biL,  ^2 L,  •  •  • ,  bdi) 

In  binary  cube  divisions  we  consider  each  single  bit  as  a  character,  whereas  in  dyadic 
subdivision  a  character  consists  of  d  bits.  I.e.,  the  j-th  character  consists  of  the  bits 
b\j, . . . ,  bdj  which  in  the  sense  of  section  2.2.1  can  be  considered  as  binary  representation 
of  an  integer  in  the  range  0, . . .  ,2d~1  corresponding  to  a  certain  enumeration  of  the 
children  of  a  cell  in  a  dyadic  subdivision  scheme. 
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2.4  Random  Shifts 


The  above  recovery  schemes  suffer  from  the  following  fact.  For  any  two  points  x,  x'  £  X 
the  tree  distance  distr(®,  x')  is  the  shortest  path  in  the  tree  T (X)  connecting  the  nodes 
&L,k(x)  3  x  and  3  x'  ■  Of  course,  whereas  \\x— x'\\  may  be  arbitrarily  small  for  any 
fixed  norm  ||  •  ||  on  Mrf,  the  tree  distance  distRx,  x')  could  be  2 L.  The  above  recovery 
scheme  takes  local  averages  of  function  values  whose  tree  distance  is  small  possibly 
omitting  values  for  arguments  that  are  geometrically  very  close.  In  fact,  an  adversary 
effect  on  the  quality  of  the  reconstruction  is  reflected  by  numerical  experiments  that 
will  be  shown  later  below.  There  are  several  possible  remedies.  Since  the  recovery 
scheme  is  very  fast,  the  perhaps  simplest  one  is  to  perform  several  different  recoveries 
with  respect  to  randomly  slightly  shifted  coordinate  systems  and  then  take  the  average 
of  the  outputs. 

In  our  implementation  we  scale  the  data  to  the  interval  [0.3,  0.7],  and  then  shift  the 
data  with  random  vectors  in  [— 0.3,0. 3]d.  Let  fp(x)  denote  the  result  of  a  query  at  x 
with  the  data  shifted  by  the  vector  p  and  Xp(x)  be  the  corresponding  set  of  training 
points  in  the  leaf  of  the  sparse  occupancy  tree  containing  x.  Furthermore  let  R(x )  be 
the  set  of  shifts  p  for  which  the  level  of  the  evaluation  is  maximal.  Then  we  have  tested 
the  following  two  schemes  to  compute  a  result  from  the  random  shifts: 

<4) 

or 

f(x)  =  ^ - \iX  7-xv  (#(Xp(x))fp(xj\  (5) 

Lp6ij(I)#^W)peR(l)v 

The  idea  of  the  second  formula  is  to  weight  the  points  that  occur  in  the  sets  Xp(x) 
according  to  the  number  of  their  appearance  in  these  sets.  A  third  possibility  is  to 
choose  just  one  of  the  p  £  R(x)  randomly  and  to  take  this  result.  We  usually  prefer 
the  first  version. 


3  Sparse  Occupancy  Trees  using  Simplices 

As  has  become  clear  from  the  above  abstract  description  of  a  sparse  occupancy  tree, 
there  are  in  principal  no  restriction  on  the  shape  of  the  elements  of  the  partitions.  For 
the  reasons  that  are  explained  in  the  introduction  we  want  to  build  trees  based  on 
simplices.  This  does  not  offer  any  advantages  over  cube  subdivision  if  we  only  consider 
piecewise  constant  approximations  but  allows  for  extensions  towards  piecewise  linear 
schemes,  which  will  be  the  topic  of  Section  3.3.  First,  however,  we  have  to  go  through 
some  technicalities  concerning  data  preparation  and  the  computation  of  the  bitstrings. 

3.1  Data  Preparation 

To  start  a  simplex  subdivision  scheme  we  have  to  map  all  the  data  points  into  a  simplex, 
here  we  choose  the  standard  simplex 

S  =  {x  £  :  0  <  x\  <  X2  <  . . .  <  Xd  <  1}. 
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If  one  has  data  in  the  unit  cube  [0,  l]rf  this  can  be  achieved  by  the  so-called  root 
transformation  T  :  [0,  l}d  — >  5: 

x  =  (xi)f=  i  ' — »  T(x) 

which  can  be  computed  recursively  by 

T(x)d  =  x1Jd,  T(x)i  =  T(x)i+ixli/\  i  =  d-  1,...,1 

and  has  the  useful  property  that  its  Jacobian  determinant  Jt{x)  =  =  const,  see 

[FY94].  Furthermore  this  transformation  (and  its  inverse)  are  computationally  cheap 
and  numerically  stable.  However,  since  the  transformation  is  singular  on  the  boundary, 
it  makes  sense  to  scale  the  data  initially  to  the  cube  [0.125,  0.875]d. 

Remark  6.  One  should  note,  that  this  step  is  actually  not  without  concern.  Since  the 
partial  derivatives  of  the  mapping  T  vary  over  a  large  range  of  magnitudes  the  metric 
of  the  original  data  is  effectively  distorted.  We  assume  that  this  might  be  the  reason 
for  the  deterioration  of  the  approximation  quality  we  observe  in  some  of  our  numerical 
experiments. 

3.2  Bitstream  Generation 

Next  we  have  to  compute  for  any  data  point  x  its  corresponding  bitstream  b(x).  We 
start  with  the  simplex  S  =  5(0)  and  initialize  the  bitstream  b  =  0.  Then  we  pro¬ 
ceed  successively  with  the  following  bisection  algorithm,  where  we  assume  that  af¬ 
ter  some  steps  of  subdivision  x  is  contained  in  the  simplex  5(b)  with  the  vertices 
vfj  =  0, ...  ,d.  We  assume  that  with  respect  to  these  vertices  x  has  the  barycentric 
coordinates  t(x,  5(b))  =  (to,  . . . ,  rf)  given  by 

d  d 

x  =  '52TjV3,  H  =  1  • 

1= o  1=0 

To  perform  one  bisection  step  we  choose  two  vertices  vk,  vl  and  subdivide  the  edge 
that  connects  these  two  vertices  at  its  midpoint.  Then  we  calculate  the  barycentric 
coordinates  of  x  with  respect  to  the  two  resulting  subsimplices: 

d  d 

X  =  TjV J  TjV i  +  TkVk  +  TlV1 

1=0  j=Q,j^k,l 

=  E  r^  +  2rfe(^)+(r;-rfcy 

j=o,j¥=k,i  v  J 

d 

=  E  TivJ  +  ( Tk  ~  Tl">vk  +  2ti 

j=0,j¥=k,l 

If  all  barycentric  coordinates  of  a  point  are  in  the  range  [0,1],  it  can  be  concluded 
that  x  is  in  the  interior  of  the  simplex.  Therefore,  if  ti  <  r/c  then  x  is  contained  in 


vk  +  vl  \ 

)  ‘ 
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the  subsimplex  connected  to  the  vertex  v\.  In  this  case  we  replace  vk  by  \{vk  +  vl), 
Tfc  by  2 Tfc,  r i  by  (77  —  tj.)  and  add  0  to  the  bitstream  b.  Else  if  77  <  we  replace 
vl  by  \{vk  +  vl),  ti  by  2 77,  Tfc  by  Tk  —  77  and  append  1  to  b.  (In  case  77  =  both 
cases  could  be  applied.  In  order  to  have  uniqueness  of  the  representation,  the  tie-break 
should  be  consistent  and  give  always  the  same  decision.)  Then  we  proceed  with  the 
next  subsection.  It  is  important  to  note  that  only  two  coordinates  are  changed  for 
each  bisection  and  only  one  vertex  is  added  to  the  total  set  of  vertices  produced  in  the 
course  of  the  subdivision  process. 

The  question  is  what  edge  one  wants  subdivide  in  each  step.  In  the  following 
subsection  we  describe  three  schemes,  of  which  we  usually  prefer  the  last  one. 

3.2.1  Longest  Edge  Subdivision 

As  the  name  tells  in  this  scheme  we  subdivide  in  each  step  the  longest  edge  (where 
length  typically  is  measured  in  the  Euclidian  norm)  of  the  current  simplex.  From 
the  geometrical  point  of  view  this  might  be  the  most  desirable  choice  of  subdivision, 
because  this  choice  maximizes  the  size  of  the  minimum  interior  angle  of  the  resulting 
simplices,  resp.  minimizes  the  diameter  of  the  resulting  simplices.  Intuitively  it  is 
clear,  that  especially  the  last  property  might  be  helpful  for  the  approximation,  because 
this  way  one  can  hope  to  minimize  the  distance  between  the  training  points  that  are 
averaged  to  get  the  value  of  a  query  points.  However,  it  requires  some  computation 
to  find  out,  what  currently  the  longest  edge  is,  and  since  the  bitstream  generation 
is  a  major  cost  factor  in  the  training  process,  longest  edge  subdivision  is  practically 
unfeasible. 

3.2.2  Oldest  Edge  Subdivision 

Oldest  edge  subdivision  is  sometimes  used  in  3D  or  4D  grid  generation  systems  as  a 
more  efficient  replacement  of  longest  edge  subdivision.  Here  we  start  with  the  simplex 
5(0)  defined  above  and  enumerate  its  vertices  according  to  the  number  of  coordinates 
that  equal  zero: 

Vi  =  (0, .  „  ,  0,  )• 

i-times  d-i-times 

In  the  first  d  bisection  steps  we  choose  k,l  <  d  such  that  \k  —  l\  is  maximal,  discard  Vk 
or  vi  according  to  the  algorithm  described  above  and  name  the  new  vertex  ^(vk+vl)  =: 
Vd+i-  This  serves  to  subdivide  the  edges  connecting  the  vertices  of  the  initial  simplex 
in  the  order  according  to  length.  After  d  steps  only  one  of  the  original  vertices  is  left, 
and  the  vertices  are  sorted  according  to  creation  time,  i.e.,  if  k  <  l,  then  vk  <  vl . 
Beginning  with  the  d+  1st  subdivision  we  just  subdivide  the  oldest  edge,  i.e.  the  edge 
that  is  connected  by  the  two  oldest  vertices.  Again  we  discard  one  of  the  vertices  from 
the  list  and  name  the  new  vertex  d  +  i. 

3.2.3  An  Alternative  Edge  Subdivision  Rule 

An  alternative  to  oldest  edge  subdivision  can  be  described  as  follows.  We  organize  the 
vertices  into  two  groups.  One  will  consist  of  “old”  vertices,  which  will  be  denoted  by 
v3 ,  as  above,  and  the  other  one  will  be  comprised  of  “new”  vertices  w3 .  The  general 
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form  of  an  intermediate  simplex  arising  in  this  process  will  be 


5(b)  = 


vp ,  vp+1 , . . . ,  vq ,  wq-p+l  ,wq~p+2,...,wd 


where  0  <  p  <  q  <  d.  The  initial  simplex  5(0)  corresponds  to  p  =  0,  q  =  d  and  has 
only  old  vertices.  Every  bisection  will  replace  one  of  the  old  vertices  vp  or  vq  by  the 
new  vertex 

wq~p  :=^(vp  +  vq).  (6) 

After  each  bisection  the  number  of  old  vertices  of  the  new  simplex  decrease  by  one. 
In  analogy  to  dyadic  subdivision  it  takes  d  such  bisections  before  ending  up  with  an 
isotropic  refinement  of  the  initial  simplex. 

In  case  there  is  only  one  old  vertex  with  index  p  =  q,  we  declare  the  end  of  the  level 
and  start  the  next  one  by  reassigning  the  names  of  the  vertices  as  follows:  v°  :=  vp 
and  vk  :=  wk  for  k  =  1, ...,  d  and  continue  with  the  procedure. 

In  general  the  shapes  of  the  simplices  generated  by  this  algorithm  are  slightly  more 
favorable  than  the  shapes  generated  by  oldest  edge  subdivision,  and  the  book  keeping 
is  no  more  expensive,  so  that  this  is  our  method  of  choice  for  simplex  subdivision. 


3.3  Piecewise  Linear  Approximation 

On  simplex  partitions  one  can  define  piecewise  linear  approximations  by  prescribing 
values  for  all  the  vertices  and  interpolating  the  values  of  vertices  connected  to  the  cell 
a  queries  falls  into.  We  have  to  note  however,  that  in  high  dimensions  it  can  hardly  be 
the  aim  to  achieve  quadratic  order  of  convergence  -  this  would  require  that  one  could 
prescribe  highly  accurate  values  for  the  vertices  which  would  be  a  very  hard  task.  The 
real  motivation  for  the  development  of  such  vertex  schemes  is  different.  One  aspect  is, 
that  it  might  facilitate  the  generation  of  globally  continuous  approximations,  resulting 
in  smoother  approximants  than  piecewise  constants  and  diminishing  the  variance  of 
the  approximation,  a  property  that  might  be  of  interest,  in  particular,  for  regression 
problems.  We  do  not  pursue  this  idea  deeply  in  the  current  paper  which  is  exclusively 
concerned  with  data  interpolation.  The  challenge  in  the  construction  of  continuous 
approximants  is  how  to  deal  with  non-uniform,  adaptive  partitions  and  hanging  nodes. 
The  main  reason  for  our  interest  in  vertex  schemes  is,  however,  that  it  provides  the 
means  to  overcome  the  tree  distance  problem,  because  it  will  not  matter  any  more  if 
two  points  separate  early  in  the  occupancy  tree:  all  training  points  near  to  a  query 
will  contribute  to  the  result  via  the  vertices  that  connect  neighboring  cells. 

In  the  following  we  will  concentrate  on  the  construction  of  a  particular  scheme 
which  preserves  the  interpolation  property  of  sparse  occupancy  algorithms  because  this 
method  offered  the  best  compromise  between  computational  efficiency  and  accuracy 
in  our  numerical  experiments.  However,  we  will  introduce  some  notations  and  ideas 
that  allow  for  the  development  of  other  variants  of  piecewise  linear  schemes.  The  main 
choices  in  the  construction  are  the  design  of  the  underlying  occupancy  tree,  i.e.  the 
depth  of  its  branches  which  corresponds  to  the  refinement  of  the  underlying  partition, 
and  how  to  define  values  for  the  vertices,  in  particular  for  the  vertices  that  are  not 
connected  to  occupied  cells,  but  might  be  needed  for  the  evaluation  of  a  query. 
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3.3.1  Notation 

To  describe  the  vertex  algorithms  in  detail  we  introduce  some  notation. 

•  For  any  d-dimensional  simplex  S  we  will  denote  the  set  of  its  d+l  vertices  with 
V(S). 

•  If  x  is  a  point  in  S,  and  v  £  V(<S)  then  t(S,v,x )  is  the  barycentric  weight  of  x 
with  respect  to  v.  These  weights  are  defined  by  the  equations 

x=  Y  r(S,v,x)v,  Y  r(S,V,x)  =  l.  (7) 

^eV(S)  veV(S) 

•  We  consider  S'(0)  to  be  the  level  0  simplex.  A  level  /  simplex  is  a  simplex  that 
emerges  from  a  level  l  —  1  simplex  by  exactly  d-binary  subdivisions  with  the 
subdivision  rule  described  in  section  3.2.3,  i.e.,  we  base  our  linear  approximation 
on  a  dyadic  tree. 

•  With  Si(x)  we  denote  the  level-1  simplex  of  the  master  tree  in  which  the  data 
point  x  lies. 

•  Let  T  be  an  occupancy  tree.  Then  the  set  of  simplices  on  level  /  in  this  occupancy 
tree  is  denoted  by  Si(T). 

•  A  level-/  vertex  is  a  corner  point  of  a  level-/  simplex.  Note,  that  a  vertex  can 
belong  to  several  levels.  Furthermore 

Vi(T)=  |J  V(S)  (8) 

sest(T) 

is  the  set  of  all  level-/  vertices  connected  to  a  level-/  simplex  of  the  tree  T. 

•  If  v  £  Vi(T),  then  Si(T ,  v)  C  Si(T)  is  the  set  of  level-/  simplices  in  the  occupancy 
tree  T  which  are  adjacent  to  v. 

•  If  in  the  subdivision  process  a  vertex  v  emerges  as  average  of  the  vertices  w1  and 
w 2  we  write  w1  =  pi(v)  and  w2  =  P2(v). 

3.3.2  Principle  of  the  Approximation 

Let  T  =  T(X)  be  a  finite  simplex-based  occupancy  tree.  This  means  that  each  node 
Qkj  in  the  tree  contains  a  training  point.  For  the  moment  we  do  not  prescribe  a  certain 
maximum  depth  L  for  the  tree  and  keep  open  the  option  that  different  branches  may 
have  different  depth. 

In  the  training  stage  of  the  piecewise  linear  approximation  we  compute  for  all  levels 
/  and  all  vertices  v  £  V;(T)  the  value 

yl(v)  =  A({yi\xie  |J  5}).  (9) 

sest(T,v) 

Note  that  according  to  the  above  definitions  the  average  is  not  taken  for  an  empty  set. 
If  v  (f  Vi(T )  we  define  its  (unweighted)  value  recursively  by  averaging  the  values  of  its 
parents: 

Vi{v )  =  yi-i(pi(v ))  +  (10) 
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This  recursion  terminates,  because  the  level-0  values  of  the  the  vertices  of  >S(0)  are 
defined,  if  the  training  data  set  is  not  empty. 

In  the  evaluation  stage  the  level-/  value  of  a  query  point  x  is  then  determined  by 
piecewise  linear  interpolation  of  the  vertex  values  of  the  level-/  cell  in  the  master  tree 
the  query  falls  into,  concretely 

fl(x)=  ^2  r(Si(x),v,x)  yi(v).  (11) 

v£V(Si(x)) 

Note,  that  the  description  of  this  algorithm  only  becomes  complete,  when  we  define 
how  exactly  we  construct  T,  i.e. ,  how  deep  we  refine  the  branches  of  the  occupancy 
tree.  Second  we  have  to  decide,  which  of  the  various  level- values  fi(x)  shall  become 
the  result  f(x)  of  the  query.  In  particular  the  above  approximation  is  not  necessarily 
continuous  and  not  necessarily  interpolating.  However,  this  can  be  enforced  by  the  right 
choice  of  T  and  /.  Furthermore  note,  that  this  algorithm  is  suited  for  online-learning 
purposes:  if  one  gets  a  new  sample,  one  just  computes  its  place  in  the  occupancy  tree 
and  adds  its  value  to  all  adjacent  vertices.  After  that  a  query  can  immediately  use  the 
new  information. 

3.3.3  Special  Schemes 

As  mentioned  before  we  mainly  aim  at  an  interpolating  scheme.  This  property  can  be 
enforced  by  choosing  the  underlying  occupancy  tree  such  that  no  two  leafs  of  T  join 
at  a  common  vertex  and  by  choosing  the  evaluation  level  /  such  that  it  is  equal  or 
larger  than  the  level  of  the  last  occupied  cell  the  query  falls  into.  In  the  experiments 
of  Section  4  we  used  a  version  that  chose  /  as  the  smallest  cell  in  the  master  tree,  that 
was  still  connected  to  at  least  one  vertex  of  the  occupancy  tree.  With  this  choices  it  is 
obviously  guaranteed  that  a  query  coincident  with  a  training  sample  returns  the  value 
of  this  training  point,  because  all  vertices  of  the  evaluation  simplex  are  influenced  by 
this  training  sample  only.  Typically  vertex  separating  trees  become  rather  deep,  which 
might  become  unpractical  if  memory  limitations  have  to  be  observed. 

Therefore  we  also  experimented  with  minimal  separating  trees,  i.e.,  we  chose  T  to  be 
the  smallest  tree,  such  that  T  contains  only  one  training  point.  However,  we  observed 
a  severe  decrease  of  accuracy  in  this  case,  so  that  we  disregarded  this  non-interpolating 
approach. 

The  easiest  (but  perhaps  not  best)  method  to  enforce  global  continuity  of  the  ap¬ 
proximation  is  to  perform  all  evaluations  at  the  same  level  /.  This  would  just  mean  to 
define  a  piecewise  linear  function  on  a  uniform  partition.  The  disadvantage  of  this  sim¬ 
plistic  approach  is,  that  for  highly  non-uniformly  distributed  data,  it  requires  frequent 
use  of  vertex  values  that  are  not  defined  by  training  points  nearby,  but  by  the  recursion 
(10).  This,  however,  decreases  the  accuracy  because  it  increases  the  probability  that 
information  is  taken  from  data  points  which  are  far  away  from  the  query  location. 

3.3.4  Variants 

Furthermore  we  have  tested  the  following  modification  of  the  algorithm: 

•  Weighted  vertex  values:  compute  the  vertex  values  not  just  by  averaging  but 
take  the  distances  (or  the  barycentric  weights)  of  the  data  points  to  the  vertices 
in  account. 


13 


•  Best  vertices:  in  the  evaluation  stage  (Equation  11)  do  not  sum  up  over  all  vertices 
of  the  simplex,  but  only  over  the  vertices  that  have  been  assigned  values  on  level 
l : 

=  E»ev(5i(x))nvi(T)  b(Si(x),v ,  x)gi(v) 

Si;eV(Sj(x))nV((T)  K^l(x)  >  vi  x) ■ 


None  of  these  modifications  delivers  a  significant  improvement  in  the  numerical  exper¬ 
iments  we  performed;  therefore  we  do  not  present  results  for  them. 


4  Numerical  Results 


In  this  Section  we  demonstrate  the  performance  of  the  above  described  schemes  with 
some  numerical  results.  As  testcases  we  have  chosen  the  examples  designed  by  Fried¬ 
man  in  [Fri91]  since  they  are  relatively  well-known.  A  limitation  of  these  examples  is 
that  the  x-data  is  always  supposed  to  be  uniformly  distributed.  Since  in  many  practical 
situations  the  input  data  is  correlated  or  otherwise  restricted  to  some  submanifold  of 
the  formal  input  space,  we  have  designed  one  testcase  of  our  own  in  order  to  cover  this 
situation,  too. 

In  all  examples  the  set  up  is  as  follows:  First  we  generate  a  test  data  set  of  M  = 
1,000,000  points,  and  then  various  training  data  sets  N  points,  where  N  =  10e  with 
e  =  3, 4, 5,  6.  This  allows  us  to  get  some  insight  into  the  convergence  behavior  of  the 
schemes.  We  measure  the  accuracy  of  the  approximation  using  the  root  mean  square 
error 


RAISE 


\ 


M 


M 


5 ^(/(z* )  -  /(z4))2 


1=1 


of  the  test  set  which  is  essentially  a  Monte-Carlo  approximation  of  the  L2-error  of  the 
approximation. 

It  is  clear,  that  in  literature  (for  instance  [MLH03])  one  easily  finds  methods  like 
CART,  support  vector  machines  or  neural  networks,  which  achieve  better  accuracy 
for  the  Friedman  problems  than  the  methods  analyzed  here.  But  these  schemes  are 
outside  the  scope  of  the  current  work  because  they  use  the  distribution  of  the  y-data 
in  their  training  processes.  The  nearest  neighbor  and  sparse  occupancy  tree  recovery 
schemes  described  in  this  paper  can  be  characterized  as  senri-adaptive  schemes  since 
they  all  use  only  the  x-data  in  order  to  decide  how  to  partition  the  input  space.  It 
therefore  seems  appropriate  to  restrict  the  comparison  to  the  relative  performance  of 
such  related  schemes. 


4.1  Friedman  1  Data  Set 

In  this  test  case  we  approximate  the  function 

y(xi, . . . , xio)  =  10 sin(7rxiX2)  +  20(x3  —  0.5)2  +  IOX4  +  5xs-  (12) 

Hereby  the  xi,...,xio  are  uniformly  distributed  over  the  ranges  0  <  x*  <  1.  The 
variables  xq,  . . .  xio  clearly  do  not  contribute  to  the  y- values  which  causes  a  deteriora¬ 
tion  of  the  convergence,  since  the  senri-adaptive  schemes  have  no  means  to  detect  that 
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N 

1,000 

10,000, 

100,000 

1,000,000 

/c-nearest  neighbor 

k 

1 

3.45642 

2.72654 

2.16177 

1.70666 

5 

2.5381 

1.87911 

1.41927 

1.07411 

10 

2.51983 

1.81674 

1.34074 

0.990133 

20 

2.6283 

1.86574 

1.35252 

0.98068 

opt-k 

8 

10 

12 

16 

opt 

2.50911 

1.81674 

1.33621 

0.976907 

Sparse  Occupancy  Trees 

Dyadic  Cubes 

4.63352 

3.36706 

3.14852 

2.65411 

Binary  Cubes 

4.41309 

3.2352 

2.30208 

2.24182 

Simplices 

5.13561 

4.04371 

3.55441 

3.22269 

Random  Shifts  (Dyad 

lie  Cubes) 

10 

3.27477 

2.56207 

1.74117 

1.33433 

50 

3.18321 

2.49552 

1.62769 

1.1639 

100 

3.11513 

2.42343 

1.67022 

1.18687 

Vertex  Algorit’ 

hm 

3.18713 

2.49601 

1.9143 

1.48756 

No.  Vertices 

21,086 

195,418 

2,010,754 

19,206,866 

Table  1:  Results  for  the  ten-dimensional  Friedman  1  example. 


these  inputs  are  irrelevant.  In  order  to  quantify  these  effect,  we  repeat  the  experiment 
without  the  extra  dimensions.  In  both  cases  the  test  set  has  a  variance  of  4.87G642. 

In  Table  1  which  displays  the  residual  mean  square  errors  calculated  with  nearest 
neighbors,  sparse  occupancy  trees,  random  shifts  and  the  interpolating  piecewise  lin¬ 
ear  vertex  algorithm,  one  can  make  the  following  observations.  The  piecewise  constant 
approximation  with  sparse  occupancy  trees  clearly  is  not  competitive  with  regard  to  ap¬ 
proximation  accuracy.  However,  random  shifts  significantly  improve  the  performance. 
Applying  a  moderate  number  of  random  shifts  clearly  outperforms  the  1-nearest  neigh¬ 
bor  method,  although  it  is  still  clearly  worse  than  the  A;-nearest  neighbor  method  with 
an  optimally  chosen  k.  In  this  particular  example  binary  splitting  of  cubes  works 
significantly  better  than  dyadic  splitting.  This  is  explained  by  the  fact,  that  the  super¬ 
fluous  variables  come  last  in  the  splitting.  That  means,  that  the  superfluous  splits  in 
these  directions  have  less  influence  on  the  tree  structure  and  the  averaging  procedure. 
The  approximation  with  piecewise  constant  approximation  on  simplices  is  significantly 
worse  than  the  cube  versionconfirming  the  suspicions  we  formulated  about  the  data 
preparation  step  in  Section  3.1. 

It  is  clear,  that  the  accuracy  of  he  answer  to  a  single  query  depends  significantly  on 
the  level  on  which  the  query  is  evaluated.  This  is  also  confirmed  by  Table  2  which  shows 
on  which  level  the  queries  were  evaluated  in  the  case  N  =  106.  As  already  explained 
in  the  introduction  one  cannot  expect  to  much  from  any  partitioning  scheme  in  this 
particular  example,  because  the  data  is  uniformly  distributed  such  that  the  training 
samples  separate  already  on  low  levels.  However,  random  shifts  have  a  significant  effect 
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on  this  statistics.  In  this  case  the  majority  of  evaluations  is  pushed  from  level  2  to 
level  3  and  no  evaluations  are  performed  on  level  1  or  2  any  more. 


single  tree 

100  random  shifts 

level 

queries 

average  error 

queries 

average  error 

1 

383762 

3.20158 

0 

2 

615339 

2.24789 

0 

3 

899 

1.2386 

743241 

1.08021 

4 

0 

256363 

1.4529 

5 

0 

395 

0.7784 

6 

0 

1 

0.593573 

Table  2:  Statistics  of  evaluation  levels  and  corresponding  errors. 


For  comparison  Table  3  lists  the  results  if  one  removes  xq,..  .x\q  from  the  input 
data.  As  expected  the  difference  between  the  dyadic  and  the  binary  tree  algorithm  be¬ 
comes  insignificant.  In  this  setting  the  vertex  algorithm  achieves  comparable  accuracy 
as  the  nearest  neighbor  algorithm.  Somewhat  unexpected  is  that  the  convergence  of 
the  random  shifts  algorithm  seems  to  almost  stall  when  one  compares  the  numbers  for 
N  =  105  and  N  =  10b.  We  have  not  observed  such  behavior  in  any  other  test  case  and 
have  no  explanation  for  it. 


N 

1,000 

10,000, 

100,000 

1,000,000 

/^-nearest  neighbors 

k 

1 

1.88274 

1.17671 

0.734954 

0.460371 

5 

1.38843 

0.809425 

0.47469 

0.28492 

10 

1.41282 

0.792543 

0.442179 

0.253858 

20 

1.56399 

0.848696 

0.452497 

0.24638 

opt-k 

6 

8 

12 

17 

opt 

1.38235 

0.789593 

0.440825 

0.245935 

Sparse  Occupancy  3 

frees 

Dyadic  Cubes 

2.72086 

1.67133 

1.04174 

0.708435 

Binary  Cubes 

2.54749 

1.66814 

1.04311 

0.644786 

Binary  Simplices 

3.29683 

2.14322 

1.43075 

0.958765 

Random  Shi 

fts  (Dyadic  Cubes) 

10 

1.54714 

0.95336 

0.626991 

0.615749 

50 

1.49284 

0.829795 

0.571043 

0.534478 

100 

1.52529 

0.826601 

0.5509 

0.512936 

Vertex  Algorithm 

1.54153 

0.88528 

0.511434 

0.289492 

No.  Vertices 

10,782 

103,758 

1,041,575 

10,042,198 

Table  3:  Results  for  5-dimensional  Friedman  1  example. 
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4.2  Friedman  2  Data  Set 

In  this  case  we  approximate  the  function 


y(x  i,...,x4) 


X2X3 


X2X4 


with  the  four  variables  uniformly  distributed  over  the  ranges  0  <  x\  <  100,  407T  < 
X2  <  5607T,  0  <  X3  <  1,  and  1  <  X4  <  11.  The  function  has  variance  3752. 


N 

1,000 

10,000, 

100,000 

1,000,000 

fc-nearest  neighbor 

k 

1 

83.1134 

45.6397 

18.9724 

13.9701 

5 

62.5134 

30.4696 

15.925 

8.63483 

10 

65.5309 

28.6937 

14.2602 

7.41368 

20 

76.067 

30.1256 

13.748 

6.71658 

opt-k 

5 

11 

18 

31 

opt 

62.5134 

28.683 

13.7265 

6.58647 

Sparse  Occupancy  Trees 

Dyadic  Cubes 

115.894 

65.4261 

37.8324 

21.219 

Binary  Cubes 

119.475 

65.1913 

35.9613 

20.854 

Simplices 

142.638 

84.5237 

47.9805 

27.8929 

Random  Shifts  (Dyadic  Cubes) 

10 

75.4826 

38.8928 

19.7915 

14.5742 

50 

66.2124 

35.9743 

19.5236 

10.5921 

100 

65.304 

35.257 

19.6445 

10.4952 

Vertex  Algorithm 

65.244 

29.7608 

14.0904 

7.02889 

No.  Vertices 

8911 

86303 

837,773 

8,225,310 

Table  4:  Results  for  Friedman  2  Example 


This  test  case  merely  confirms  the  conclusions  outlined  in  the  previous  sections. 
The  last  time  of  the  table  lists  how  many  vertices  have  been  assigned  values  in  the 
training  stage  of  the  vertex  algorithm.  This  information  is  relevant  because  the  exten¬ 
sive  memory  consumption  of  the  vertex  algorithm  seems  to  be  its  major  disadvantage 
for  its  practical  application.  In  this  case  the  trained  vertex  tree  needs  about  8  —  9  times 
more  memory  than  the  incoming  data.  However,  this  seems  still  more  economical  than 
storing,  say,  50  or  100  randomly  shifted  occupancy  trees,  so  that  the  vertex  algorithm 
is  surely  more  memory  efficient  than  the  random  shift  algorithm. 
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4.3  Friedman  3  Data  Set 

Here 

,  x  ,  -i  f  x2x3  -  (x2x4y1\ 

y{x i, . . .  ,.t4)  =  tan  - 

V  *i  J 

with  0  <  x\  <  100  ,  407T  <  x2  <  5607T,  0  <  x3  <  1,  and  1  <  x4  <  11.  The  variance 
of  the  test  data  set  is  0.3165252.  Note  that  this  function  has  a  very  steep  gradient  if 
X\  — >  0  and  almost  jumps  from  — 7r/2  to  tt/2  when  the  numerator  changes  sign. 


N 

1,000 

10,000, 

100,000 

1,000,000 

fc-nearest  neighbor 

k 

1 

0.155238 

0.103645 

0.0692366 

0.0439895 

5 

0.135107 

0.0868966 

0.0556544 

0.0348372 

10 

0.140372 

0.0905709 

0.0570974 

0.0355295 

20 

0.153082 

0.0978829 

0.0625607 

0.0386026 

opt-k 

5 

5 

5 

6 

opt 

0.135107 

0.0868966 

0.0556544 

0.0347881 

Sparse  Occupancy  Trees 

Dyadic  Cubes 

0.202478 

0.116544 

0.0811854 

0.0602853 

Binary  Cubes 

0.215195 

0.118146 

0.0843254 

0.064841 

Simplices 

0.176661 

0.11912 

0.0887659 

0.0491299 

Random  Shifts  (Dyadic  Cubes) 

10 

0.140396 

0.094879 

0.0585261 

0.0362686 

50 

0.139053 

0.0922174 

0.060494 

0.0368731 

100 

0.140396 

0.09131 

0.0608876 

0.0375257 

Vertex  Algorithm 

0.0960549 

0.0601811 

0.0336853 

0.0199746 

No.  Vertices 

8,911 

86,303 

837,773 

8,225,310 

Table  5:  Results  for  Friedman  3  Example 


In  Table  5  we  see  that  the  optimal  number  of  nearest  neighbors  almost  does  not 
increase  when  N  is  growing,  indicating  that  the  target  function  indeed  is  not  very 
smooth.  In  this  case  the  vertex  algorithm  performs  even  better  than  the  optimal  nearest 
neighbor  algorithm.  This  can  be  explained  by  its  interpolation  property,  which  is  very 
helpful  here  due  to  the  nearly  discontinuous  behavior  of  the  function.  Furthermore  one 
might  assume  that  the  piecewise  linear  approximant  might  improve  accuracy  in  regions 
of  steep  gradients. 
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4.4  Data  on  Submanifold 


In  this  example  we  consider  samples  of  the  smooth  function  f(x)  =  sin(27r(xi+-  •  -+3^)) 
for  d  =  20  and  sample  sites  X  that  are  uniformly  distributed  on  a  randomly  chosen 
5-dinrensional  sphere  in  [0,  l]20. 


N 

1,000 

10,000, 

100,000 

1,000,000 

7-nearest  neighbor 

1 

0.629136 

0.376805 

0.218324 

0.123483 

5 

0.547673 

0.280663 

0.143379 

0.0765174 

10 

0.590996 

0.293347 

0.136615 

0.0671632 

20 

0.648708 

0.348594 

0.147633 

0.0644286 

opt-k 

3 

5 

9 

18 

opt 

0.539988 

0.280663 

0.136575 

0.0643436 

Sparse  Occupancy  Trees 

Dyadic  Cubes 

0.66076 

0.461315 

0.284637 

0.165377 

Binary  Cubes 

0.699489 

0.474227 

0.28751 

0.166394 

Simplices 

0.667239 

0.502764 

0.354782 

0.230808 

1 

Random  Shifts  (Dyadic  Cubes) 

10 

0.638401 

0.408558 

0.233137 

0.116529 

50 

0.584682 

0.327386 

0.186034 

0.108537 

100 

0.583817 

0.314226 

0.174883 

0.103498 

Vertex  Algorithm 

0.501392 

0.294926 

0.150229 

0.0667254 

No.  Vertices 

53,271 

516,167 

5,085,728 

50,918,786 

Table  6:  Results  for  Data  on  Submanifold  Example 


Standard  estimates  for  piecewise  constant  approximation  predict  that  for  a  smooth 
function  y  =  f(x)  the  L2-error  on  a  non-adaptive  partition  is  proportional  to  N ~1/D^ 
where  D  is  the  dimension  of  the  manifold  from  which  the  samples  are  drawn.  Therefore 
we  compute  the  quantity 


i,  =  jggww  (13) 

\o%(RMSEi/ RMSE2) 

where  N\,N2  are  the  number  of  training  points  and  RMSE\ ,  RMSE2  are  the  calcu¬ 
lated  root  mean  square  errors  of  two  experiments  with  the  same  numerical  scheme  as 
measure  for  the  convergence  speed.  In  this  current  example  we,  of  course,  except  values 
for  D  around  5.  Indeed  we  observe  values  between  4.5  and  6.5  for  the  sparse  occupancy 
trees,  about  3.5  to  4.5  for  the  nearest  neighbor  algorithm  and  between  3  and  4.5  for 
the  vertex  algorithm.  Even  if  these  numbers  are  not  too  reliable,  they  clearly  indicate 
that  all  the  schemes  indeed  capture  the  submanifold,  and  do  not  converge  with  the 
rate  which  would  be  expected  for  a  real  20-dinrensional  problem. 
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4.5  Comparison  of  CPU  Times 

To  give  an  impression  of  computational  efficiency  of  the  scheme  the  following  table 
shows  the  computational  times  (in  seconds)  for  processing  a  test  and  a  training  data 
set  of  106  points  both  for  the  submanifold  example,  measured  on  a  computer  with  2.3 
GHz  AMD  opteron  processor.  The  comparison  with  the  fc-nearest  neighbors  algorithm 
is  not  absolutely  fair  since  we  used  a  brute  force  algorithm  and  in  twenty  dimensions 
much  faster  approximate  nearest  neighbor  methods  are  available,  but  we  did  not  have 
access  to  a  good  implementation. 


Cube  Tree 

Vertices 

k-NN 

CART 

Training  Stage 

6.6s 

220s 

0s 

100s 

Evaluation  Stage 

14s 

334s 

65653s 

32s 

Therefore,  to  keep  a  realistic  image,  we  add  the  timings  found  for  a  standard  CART 
algorithm  (see  [BFOS84])  with  a  single,  unpruned  tree.  The  generation  of  regression 
trees  with  the  CART  algorithm  can  be  considered  as  computationally  efficient  but  it 
is  not  easily  adapted  to  online  learning.  Furthermore  one  has  to  note,  that  one  usually 
has  to  generate  a  larger  number  of  occupancy  trees  or  regression  trees  (like  in  the 
random  forest  algorithm  [BreOl])  in  order  to  achieve  good  accuracy. 


5  Conclusion 

In  this  paper  we  presented  algorithms  for  high  dimensional  approximation  based  on 
sparse  occupancy  trees  which  are  well  suited  for  large  data  sets  and  online  learning. 
It  was  demonstrated  that  they  offer  a  viable  alternative  to  the  fc-nearest  neighbor 
approximation.  In  particular  the  vertex  algorithm  seems  to  have  further  potential, 
since  in  some  application  it  already  outperforms  classical  approximation  methods  and 
there  are  several  direction  in  which  one  can  search  for  improvements,  notably  with 
regard  to  the  construction  of  globally  continuous  approximants. 
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